You need to be online to properly view this slideshow presentation.¶

Modern Technology¶

Data Fundamentals¶

Speaker: Alexander Lacson

'Spreadsheets: What most people are familiar with'
Name Landmass Zone Area Population Language Religion Bars Stripes Colors ... Saltires Quarters Sunstars Crescent Triangle Icon Animate Text Topleft Botright
0 Afghanistan 5 1 648 16 10 2 0 3 5 ... 0 0 1 0 0 1 0 0 black green
1 Albania 3 1 29 3 6 6 0 0 3 ... 0 0 1 0 0 0 1 0 red red
2 Algeria 4 1 2388 20 8 2 2 0 3 ... 0 0 1 1 0 0 0 0 green white
3 American-Samoa 6 3 0 0 1 1 0 0 5 ... 0 0 0 0 1 1 1 0 blue red
4 Andorra 3 1 0 0 6 0 3 0 3 ... 0 0 0 0 0 0 0 0 blue red
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
189 Western-Samoa 6 3 3 0 1 1 0 0 3 ... 0 1 5 0 0 0 0 0 blue red
190 Yugoslavia 3 1 256 22 6 6 0 3 4 ... 0 0 1 0 0 0 0 0 blue red
191 Zaire 4 2 905 28 10 5 0 0 4 ... 0 0 0 0 0 1 1 0 green green
192 Zambia 4 2 753 6 10 5 3 0 4 ... 0 0 0 0 0 0 1 0 green brown
193 Zimbabwe 4 2 391 8 10 5 0 7 5 ... 0 0 1 0 1 1 1 0 green green

194 rows × 30 columns

'Also what Computers are familiar with. What the computer understands as a table...'
pixel_0_0 pixel_0_1 pixel_0_2 pixel_0_3 pixel_0_4 pixel_0_5 pixel_0_6 pixel_0_7 pixel_1_0 pixel_1_1 ... pixel_6_6 pixel_6_7 pixel_7_0 pixel_7_1 pixel_7_2 pixel_7_3 pixel_7_4 pixel_7_5 pixel_7_6 pixel_7_7
0 0.0 0.0 5.0 13.0 9.0 1.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 6.0 13.0 10.0 0.0 0.0 0.0
1 0.0 0.0 0.0 12.0 13.0 5.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 11.0 16.0 10.0 0.0 0.0
2 0.0 0.0 0.0 4.0 15.0 12.0 0.0 0.0 0.0 0.0 ... 5.0 0.0 0.0 0.0 0.0 3.0 11.0 16.0 9.0 0.0
3 0.0 0.0 7.0 15.0 13.0 1.0 0.0 0.0 0.0 8.0 ... 9.0 0.0 0.0 0.0 7.0 13.0 13.0 9.0 0.0 0.0
4 0.0 0.0 0.0 1.0 11.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 2.0 16.0 4.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1792 0.0 0.0 4.0 10.0 13.0 6.0 0.0 0.0 0.0 1.0 ... 4.0 0.0 0.0 0.0 2.0 14.0 15.0 9.0 0.0 0.0
1793 0.0 0.0 6.0 16.0 13.0 11.0 1.0 0.0 0.0 0.0 ... 1.0 0.0 0.0 0.0 6.0 16.0 14.0 6.0 0.0 0.0
1794 0.0 0.0 1.0 11.0 15.0 1.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 2.0 9.0 13.0 6.0 0.0 0.0
1795 0.0 0.0 2.0 10.0 7.0 0.0 0.0 0.0 0.0 0.0 ... 2.0 0.0 0.0 0.0 5.0 12.0 16.0 12.0 0.0 0.0
1796 0.0 0.0 10.0 14.0 8.0 1.0 0.0 0.0 0.0 2.0 ... 8.0 0.0 0.0 1.0 8.0 12.0 14.0 12.0 1.0 0.0

1797 rows × 64 columns

'... We can recognize as images'
Example taken from Plotly Online Documentation. https://plotly.com/python/imshow/#displaying-an-image-and-the-histogram-of-color-values
What we understand as the lyrics of Bruno Mars' Versace on the Floor..

Let's take our time tonight, girl
Above us all the stars are watchin'
There's no place I'd rather be in this world
Your eyes are where I'm lost in
Underneath the chandelier
We're dancin' all alone
There's no reason to hide
What we're feelin' inside
Right now

So, baby, let's just turn down the lights and close the door
Ooh, I love that dress, but you won't need it anymore
No, you won't need it no more
Let's just kiss 'til we're naked, baby

Versace on the floor
Ooh, take it off for me, for me, for me, for me now, girl
Versace on the floor
Ooh, take it off for me, for me, for me, for me now, girl

I'll unzip the back and watch it fall
While I kiss your neck and shoulders
No, don't be afraid to show it all
I'll be right here ready to hold you
Girl, you know you're perfect from
Your head down to your heels
Don't be confused by my smile
'Cause I ain't ever been more for real, for real

So just turn down the lights
And close the door
Ooh, I love that dress, but you won't need it anymore
No, you won't need it no more
Let's just kiss 'til we're naked, baby

Versace on the floor
Ooh, take it off for me, for me, for me, for me now, girl
Versace on the floor
Ooh, take it off for me, for me, for me, for me now, girl
Dance

It's warmin' up
Can you feel it?
It's warmin' up
Can you feel it?
It's warmin' up
Can you feel it, baby?
It's warmin' up
Oh, seems like you're ready for more, more, more
Let's just kiss 'til we're naked

Versace on the floor
Hey, baby
Take it off for me, for me, for me, for me now, girl
Versace on the floor
Ooh, take it off for me, for me, for me, for me now, girl

Versace on the floor
Floor
Floor

The computer can only understand as a row of a spreadsheet
floor girl versace ooh take let baby kiss need warmin ... head heel hey hide hold inside know alone lose world
0 9 8 7 7 7 5 5 4 4 4 ... 1 1 1 1 1 1 1 1 1 1

1 rows × 61 columns

Please enter three sentences
Please take your COVID vaccination. It doesn't matter which brand. You are protecting yourself and others against COVID.
covid brand matter others please protect take vaccination
0 2 1 1 1 1 1 1 1
How about sound?
Fundamental 1st Harmonic 2nd Harmonic 3rd Harmonic 4th Harmonic
Frequency 440.0 880 1320 1760.0 2200.0
Amplitude 6.5 5 3 2.4 1.0
Phase Offset 0.0 1 4 3.0 2.1

drawing

What about data that ECE grad will see in the semiconductor industry?¶

DUT means Device Under Test
PIN 1 to GND Contact Test PIN 2 to GND Contact Test PIN 3 to GND Contact Test PIN 4 to GND Contact Test PIN 5 to GND Contact Test PIN 6 to GND Contact Test PIN 7 to GND Contact Test PIN 8 to GND Contact Test Passed/Failed
DUT_1 5.0001 5.0004 5.0004 5.0004 5.0004 5.00040 5.0004 5.0004 Passed
DUT_2 5.0001 5.0004 5.0004 5.0004 5.0004 5.00040 5.0004 5.0004 Passed
DUT_3 5.0001 5.0004 5.0004 5.0004 5.0004 5.00040 5.0004 5.0004 Passed
DUT_4 5.0001 5.0004 5.0004 5.0004 5.0004 5.00040 5.0004 5.0004 Passed
DUT_5 5.0001 5.0004 5.0004 5.0004 5.0004 0.13234 5.0004 5.0004 Failed
DUT_6 5.0001 5.0004 5.0004 5.0004 5.0004 5.00040 5.0004 5.0004 Passed
DUT_7 5.0001 5.0004 5.0004 5.0004 5.0004 5.00040 5.0004 5.0004 Passed
DUT_8 5.0001 5.0004 5.0004 5.0004 5.0004 5.00040 5.0004 5.0004 Passed

structured_unstructured.svg All Data end up becoming spreadsheets.

Tools for working with Data¶

image.png

image.png

When you receive data:¶

Inspect the condition and make corrections

Are there missing values?
<AxesSubplot:>
Correct the data formats

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59946 entries, 0 to 59945
Data columns (total 31 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   age          59946 non-null  int64  
 1   body_type    54650 non-null  object 
 2   diet         35551 non-null  object 
 3   drinks       56961 non-null  object 
 4   drugs        45866 non-null  object 
 5   education    53318 non-null  object 
 6   essay0       54458 non-null  object 
 7   essay1       52374 non-null  object 
 8   essay2       50308 non-null  object 
 9   essay3       48470 non-null  object 
 10  essay4       49409 non-null  object 
 11  essay5       49096 non-null  object 
 12  essay6       46175 non-null  object 
 13  essay7       47495 non-null  object 
 14  essay8       40721 non-null  object 
 15  essay9       47343 non-null  object 
 16  ethnicity    54266 non-null  object 
 17  height       59943 non-null  float64
 18  income       11504 non-null  float64
 19  job          51748 non-null  object 
 20  last_online  59946 non-null  object 
 21  location     59946 non-null  object 
 22  offspring    24385 non-null  object 
 23  orientation  59946 non-null  object 
 24  pets         40025 non-null  object 
 25  religion     39720 non-null  object 
 26  sex          59946 non-null  object 
 27  sign         48890 non-null  object 
 28  smokes       54434 non-null  object 
 29  speaks       59896 non-null  object 
 30  status       59946 non-null  object 
dtypes: float64(2), int64(1), object(28)
memory usage: 14.2+ MB
None
0    2012-06-28-20-30
1    2012-06-29-21-41
2    2012-06-27-09-10
3    2012-06-28-14-22
4    2012-06-27-21-26
Name: last_online, dtype: object
Are there duplicates?

(5824, 4)
category scientific_name common_names conservation_status
8 Mammal Canis lupus Gray Wolf Endangered
3020 Mammal Canis lupus Gray Wolf, Wolf In Recovery
4448 Mammal Canis lupus Gray Wolf, Wolf Endangered
29 Mammal Eptesicus fuscus Big Brown Bat Species of Concern
3035 Mammal Eptesicus fuscus Big Brown Bat, Big Brown Bat Species of Concern
3150 Bird Gavia immer Common Loon, Great Northern Diver, Great North... Species of Concern
172 Bird Gavia immer Common Loon Species of Concern
30 Mammal Lasionycteris noctivagans Silver-Haired Bat Species of Concern
3037 Mammal Lasionycteris noctivagans Silver-Haired Bat, Silver-Haired Bat Species of Concern
4465 Mammal Myotis californicus California Myotis Species of Concern
3039 Mammal Myotis californicus California Myotis, California Myotis, Californ... Species of Concern
4467 Mammal Myotis lucifugus Little Brown Myotis Species of Concern
3042 Mammal Myotis lucifugus Little Brown Bat, Little Brown Myotis, Little ... Species of Concern
37 Mammal Myotis lucifugus Little Brown Bat, Little Brown Myotis Species of Concern
337 Bird Nycticorax nycticorax Black-Crowned Night-Heron Species of Concern
4564 Bird Nycticorax nycticorax Black-Crowned Night Heron Species of Concern
3283 Fish Oncorhynchus mykiss Rainbow Trout Threatened
3081 Bird Pandion haliaetus Osprey, Western Osprey Species of Concern
104 Bird Pandion haliaetus Osprey Species of Concern
226 Bird Riparia riparia Bank Swallow Species of Concern
3185 Bird Riparia riparia Bank Swallow, Sand Martin Species of Concern
3029 Mammal Taxidea taxus American Badger, Badger Species of Concern
4457 Mammal Taxidea taxus Badger Species of Concern

Exploring and Visualizing Large Datasets, Univariate Analysis¶

What if you have thousands of rows and hundreds of columns? How do you summarize it?

image.png

Multivariate Analysis¶

Start comparing different groups, and investigate relationships and associations.

Correlation Heatmap¶

corr_grid.svg

The mean difference is: -19.11905597473242
The median difference is: -19.0
pval = ttest_ind(thalach_hd, thalach_no_hd)
P-value is 3.456964908430172e-14. Lower than 5% threshold. Difference is statistically significant
Groups and Subgroups: Stratification

Artifical Intelligence¶

timeline-2.svg

image.png

ChooseMLmodel.png

Training and Evaluation of a Machine Learning AI¶

OKCupid Data Gender Classifier
Regression Example
Clustering Example

If you have a lot of data in your organization you can harness the power of artificial intelligence.

Data-driven decision making, Business Intelligence, Data Science¶

Slide2.jpg

Slide3.jpg

AI Analytics Case Studies¶

Cambridge Analytica. Psychometrics
Openness Conscientiousness Extraversion Agreeableness Neuroticism Age Sex Photos_Qty Will_vote_for
Person_1 0 0 0 0 0 0 0 0 Duterte
Person_2 0 0 0 0 0 0 0 0 Roxas
Person_3 0 0 0 0 0 0 0 0 Duterte
Person_4 0 0 0 0 0 0 0 0 Roxas
Person_5 0 0 0 0 0 0 0 0 Duterte
Person_6 0 0 0 0 0 0 0 0 Roxas
Person_7 0 0 0 0 0 0 0 0 Duterte
Person_8 0 0 0 0 0 0 0 0 Roxas

image.png

image.png

Qualifying for a Loan
Age Sex Marital Status Age of Spouse Occupation Educational Attainment Income Has a Car? Number of Dependents Will_fail_to_pay_loan
Customer_1 0 0 0 0 0 0 0 0 0 20% Chance
Customer_2 0 0 0 0 0 0 0 0 0 60% Chance
Customer_3 0 0 0 0 0 0 0 0 0 20% Chance
Customer_4 0 0 0 0 0 0 0 0 0 60% Chance
Customer_5 0 0 0 0 0 0 0 0 0 20% Chance
Customer_6 0 0 0 0 0 0 0 0 0 60% Chance
Customer_7 0 0 0 0 0 0 0 0 0 20% Chance
Customer_8 0 0 0 0 0 0 0 0 0 60% Chance
Education
Algebra Calculus Vectors Analytical Geometry Differential Equations Feedback and Control Systems Numerical Methods Solid Mensuration Will_fail_Advanced_Math
Student_1 0 0 0 0 0 0 0 0 30% Chance
Student_2 0 0 0 0 0 0 0 0 60% Chance
Student_3 0 0 0 0 0 0 0 0 30% Chance
Student_4 0 0 0 0 0 0 0 0 60% Chance
Student_5 0 0 0 0 0 0 0 0 30% Chance
Student_6 0 0 0 0 0 0 0 0 60% Chance
Student_7 0 0 0 0 0 0 0 0 30% Chance
Student_8 0 0 0 0 0 0 0 0 60% Chance

Recap¶

  • Unstructured and Structured Data. All Data are Spreadsheets.
  • There are lots of different software tools for working with data.
  • When you receive data, first inspect and fix the condition.
  • Explore data and make reports using visualizations.
  • Look for patterns and associations. Compare different groups, and investigate relationships.
  • Machine Learning involves running experiments. You select, evaluate, and refine different models.
  • To be a manager these days you are now required to know how to work with and understand data because of the focus on data-driven decision making.
  • The Data Science and Business Intelligence Process starts with a question and ends with a decision. You must ask the right question and use the findings to make intelligent decisions.
  • AI is used in all domains now. Case studies shown in the talk were: Psychometrics, Sports Analytics, Stock Market Prediction, Qualifying for a Loan, and Education.

Suggested Viewing¶

  • Moneyball starring Brad Pitt

About Me
Alexander Lacson
Data Scientist with a background in Electronics
Contact lacsonalexanderz@gmail.com

Portfolio Projects

  • Life Expectancy and GDP. It Matters More for Developing Nations
  • Assigning Priority Levels to Endangered Species As a Data Scientist
  • OKCupid Exploratory Data Analysis
  • OKCupid Machine Learning Gender Classification Model
  • Dating Pools using K-Means Clustering